AI Security
LLM safety
RLHF・DPO
ハルシネーション対策
LLMOps
LLMOps
AI safety
AIディフェンス研究所
https://jpsec.ai/blog/
Security camp感想
https://ryoryon66.hatenablog.com/entry/2022/10/03/103859
AIJack
https://github.com/Koukyosyumei/AIJack
PySyft
https://github.com/OpenMined/PySyft
Generative AI and Large Language Models for Cyber Security: All Insights You Need
https://arxiv.org/abs/2405.12750
Security of LLM Information Hub
https://tasuku-sasaki-lab.github.io/Tasuku-Sasaki.github.io/LLM-Security/
TrustLLM: Trustworthiness in Large Language Models
https://arxiv.org/abs/2401.05561
Breaking Down the Defenses: A Comparative Survey of Attacks on Large Language Models
https://arxiv.org/abs/2403.04786v2
A Survey on Large Language Model (LLM) Security and Privacy: The Good, the Bad, and the Ugly
https://arxiv.org/abs/2312.02003
Golden Gate Claude
https://www.anthropic.com/news/golden-gate-claude
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html#/
Improving Alignment and Robustness with Circuit Breakers
https://arxiv.org/abs/2406.04313v2
OpenAI APIで、問題のある発言を検出するmoderationモデルを試してみた
https://dev.classmethod.jp/articles/openai-api-moderation-model/
ChatGPT "DAN" (and other "Jailbreaks")
https://github.com/0xk1h0/ChatGPT_DAN
Universal and Transferable Adversarial Attacks on Aligned Language Models
https://llm-attacks.org/
NeMo-Guardrails
https://github.com/NVIDIA/NeMo-Guardrails?tab=readme-ov-file#nemo-guardrails
LLMにおけるガードレールについて
https://zenn.dev/ayumuakagi/articles/llm_guardrails
【2024.9.9 AIアライメントネットワーク設立記念シンポジウム】#1「ALIGNの挑戦」髙橋恒一(ALIGN代表理事)
https://www.youtube.com/watch?v=_13ORbYifbU&t=910s
Singular Learning TheoryとAI Alignmentが結びつくのが面白い。そろそろ寝なければと思っていたのに、眠れなくなった。アラインメントとFree Energyが結びつき、脳のことにまで発想が広がってきた。
Hacking Back the AI-Hacker: Prompt Injection as a Defense Against LLM-driven Cyberattacks
https://arxiv.org/abs/2410.20911
多文化・他言語対応の安全な大規模言語モデルの構築を目指して
https://www.youtube.com/watch?v=NLaayZ4v6Ag
LLMのアウトプットをバリデーションする関数が集まるGuardrails Hubを試す
https://zenn.dev/gaudiy_blog/articles/9de43ed4b260ce
BlackDAN: A Black-Box Multi-Objective Approach for Effective and Contextual Jailbreaking of Large Language Models
https://arxiv.org/abs/2410.09804
SLM as Guardian: Pioneering AI Safety with Small Language Models
https://arxiv.org/abs/2405.19795
LLM Agent Honeypot: Monitoring AI Hacking Agents in the Wild
https://arxiv.org/abs/2410.13919
Hacker Panel: What Hackers Can Tell You About AI Security
https://www.youtube.com/watch?v=eoXouUA1raQ
LLMjackingがDeepSeekを標的にする
https://sysdig.jp/blog/llmjacking-targets-deepseek/
OCCULT: Evaluating Large Language Models for Offensive Cyber Operation Capabilities
https://arxiv.org/abs/2502.15797
AISafetyLab: A Comprehensive Framework for AI Safety Evaluation and Improvement
https://arxiv.org/abs/2502.16776
AIセーフティに関するレッドチーミング手法ガイド Guide to Red Teaming Methodology on AI Safety
https://aisi.go.jp/effort/effort_framework/guide_to_red_teaming_methodology_on_ai_safety/